CASIA-CSDB: Chemical Structure DataBase

1. Introduction

The Chemical Structure DataBase (CSDB) was constructed by the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese. Academy of Sciences (CASIA). We present a large-scale chemical structure database named CASIA-CSDB, containing 480,668 samples (images corresponding to SMILES strings). In order to meet the needs of fast design and evaluation, we make a subset from CASIA-CSDB called Mini-CASIA-CSDB, by selecting images from the eight different weight partitions in the CASIA-CSDB at a rate of 20%. The ratio of the training set, validation set, and test set for both CASIA-CSDB and Mini-CASIA-CSDB is 8 : 1 : 1, number of samples of the CASIA-CSDB is 480,668 and the total number of the Mini-CASIA-CSDB is 97,309.

Download: Mini-CASIA-CSDB.zip (416.7 MB)

Download: CASIA-CSDB.zip (4.29 GB)


2. Description OF CASIA-CSDB

a. Database Constitution

CASIA-CSDB has 480,668 samples in total. Each sample is a chemical structure image corresponding to a SMILES string. The chemical structures of the database are selected from eight distinct weight partitions of ChEMBL. However, the molecular weight can not determine the complexity of the chemical structure. In contrast, the length of the SMILES string can better reflect the complexity of its chemical structure. Generally, the longer the SMILES string, the more complex the chemical structure of the sample, and the more difficult it is to recognize. 32 example images of chemical structures from eight different weight partitions are shown in Fig. 1.

Fig. 1 Chemical structure images of eight weight partitions of CASIA-CSDB.

b. Data Sources

We adopt the ChEMBL as our data source, which was extracted from more than 6500 publications and 50 databases (such as BioAssays [1], DrugMatrix [2][3], TG GATES [4], and other well-known databases). There are more than 2.1 million different chemical structures in the ChEMBL, of which 14 million activity values come from more than 1.2 million analyses. Moreover, it divides all chemical molecules into eight types (such as antibody, cell, enzyme, gene, oligonucleotide, oligosaccharide, protein, and small molecule). Besides, according to the molecular weight (from 0 g/mol to 12, 546.32 g/mol), the chemical molecules are divided into eight weight partitions. Generally, the heavier the molecular weight, the more complex the molecular structure.

c. Data Generation

In the ChEMBL database, there are eight types of molecules in the database. Since the SMILES strings of samples of cells, oligonucleotides, and oligosaccharides are of small proportion, particularly long, and molecular structure is complex, it is difficult to recognize. Therefore, these three types are eliminated and only samples from the remaining five types are selected. For each type of molecule, the data have eight weight partitions, and a certain proportion of the SMILES strings from each partition are extracted to construct the SMILES string set.

Although the types with complex structures when making the SMILES string set are eliminated, there are some extra long SMILES strings in the set. These corresponding samples generally have complex structures and are difficult to recognize. Therefore, a threshold is set for SMILES strings, and samples with a SMILES string length of more than 200 are abandoned.

After obtaining the SMILES string set, we convert each SMILES string into a corresponding chemical structure image using the RDKit [5] tool, which is an open-source toolkit for chemical informatics, based on the 2D and 3D molecular manipulation of chemical structure. Specifically, it uses machine learning methods to generate chemical structure descriptors and chemical structure images, calculates chemical structure similarity, and display 2D and 3D chemical structures. The original generated chemical structure images are RGB images with a size of 300 × 300. In order to facilitate batch processing and efficiency, all image sizes are resized to 256 × 256, and RGB images are converted to grayscale images.

3. Condition of Use

  • The CASIA-CSDB: Chemical Structure DataBase, built by CASIA, are released for academic research free of cost under an agreement.
  • Commercial use of the databases is subject to charge. For possible license of commercial use, please contact Fei Yin (fyin@nlpr.ia.ac.cn).

  • Reference

    A comprehensive description of CSDB dataset was described in:

          Longfei Ding, Mengbiao Zhang, Fei Yin, Shuiling Zeng and Cheng-Lin Liu. A Large-Scale Database for Chemical Structure Recognition and Preliminary Evaluation, ICPR 2022.

    If this dataset helps you, please cite the papers above.

    [1] J. Cairns and J. R. Pratt, “The scientific basis of bioassays,” Hydrobi- ologia, vol. 188, no. 1, pp. 5–20, 1989.

    [2] B. Ganter, S. Tugendreich, C. I. Pearson, E. Ayanoglu, S. Baumhueter, K. A. Bostian, L. Brady, L. J. Browne, J. T. Calvin, G.-J. Day et al., “Development of a large-scale chemogenomics database to improve drug candidate selection and to understand mechanisms of chemical toxicity and action,” Journal of Biotechnology, vol. 119, no. 3, pp. 219–244, 2005.

    [3] G. Natsoulis, L. El Ghaoui, G. R. Lanckriet, A. M. Tolley, F. Leroy, S. Dunlea, B. P. Eynon, C. I. Pearson, S. Tugendreich, and K. Jarnagin, “Classification of a large microarray data set: algorithm comparison and analysis of drug signatures,” Genome Research, vol. 15, no. 5, pp. 724– 736, 2005.

    [4] Y. Igarashi, N. Nakatsu, T. Yamashita, A. Ono, Y. Ohno, T. Urushidani, and H. Yamada, “Open tg-gates: a large-scale toxicogenomics database,” Nucleic Acids Research, vol. 43, no. D1, pp. D921–D927, 2015.

    [5] G. Landrum, “Rdkit documentation,” Release, vol. 1, no. 1-79, p. 4, 2013.


    Contact

    Fei Yin (fyin@nlpr.ia.ac.cn)

    National Laboratory of Pattern Recognition (NLPR)

    Institute of Automation of Chinese Academy of Sciences

    95 Zhongguancun East Road, Beijing 100190, P.R. China